American Journal of Epidemiology — Latest Matching Preprints

1

Mechanism Matters: A Monte Carlo Evaluation of Estimator Validity and Collider Bias in Environmental Mixture Epidemiology

Obeng-Gyasi, E.

2026-05-26 epidemiology 10.64898/2026.05.25.26354044 medRxiv

Top 0.1%

52.9%

Show abstract

Background: Mixture epidemiology deploys sophisticated estimators, Bayesian kernel machine regression with causal mediation analysis (BKMR-CMA), quantile G-computation (QGC), and parametric G-computation, alongside conventional regression. Comparative evaluations have assumed additive, non-mediated data-generating processes, leaving conditions under which estimator choice determines causal validity uncharacterized. Methods: We developed a simulation framework using military-relevant exposure distributions (metals, per- and polyfluoroalkyl substances [PFAS], polychlorinated biphenyls [PCBs]) and allostatic load (AL) across three deployment tiers, with parameters drawn from military occupational health and contamination literature. Four data-generating processes were specified as directed acyclic graphs: direct effects with confounding (M1), full mediation through AL (M2), synergistic AL-exposure interaction (M3), and collider structure (M4). We evaluated ordinary least squares (OLS), QGC, G-computation, and BKMR-CMA on bias, root mean squared error, and 95% confidence interval coverage across 500 Monte Carlo replications at n = 500 and n = 1,000. Results: No estimator dominated across all mechanisms. Under M1, OLS and G-computation produced near-identical modest positive bias; BKMR-CMA achieved lower root mean squared error through kernel shrinkage. Under M2, BKMR-CMA exhibited severe positive bias for AL (mean bias = +0.579 SD units; coverage = 32.8%). Under M3, BKMR-CMA was the only estimator achieving nominal 95% coverage for AL (95.2%), while regression-based approaches fell to 83.6%. Under M4, G-computation produced persistent bias and near-zero coverage for lead, reflecting structural non-identification. Conclusions: Estimator validity is fundamentally mechanism-dependent. Researchers should base estimator choice on explicit causal assumptions about whether AL functions as confounder, mediator, moderator, or collider, particularly in military and occupational cohorts. We provide a mechanism-to-estimator mapping for applied researchers.

2

Bayesian joint modelling of antibody kinetics and test-negative vaccine effectiveness to characterise hybrid immunity across epidemic waves

Benammar, A.

2026-04-27 epidemiology 10.64898/2026.04.25.26351732 medRxiv

Top 0.1%

51.5%

Show abstract

Vaccine effectiveness against symptomatic SARS-CoV-2 infection varies over time and across epidemic waves. This variation can reflect waning immunity, immune escape by emerging variants, exposure heterogeneity, and differences in previous infection history. Test-negative case-control designs are widely used to monitor vaccine effectiveness, while longitudinal serological studies describe antibody trajectories after vaccination and infection. These evidence streams are often analysed separately. This manuscript presents a simulation-based Bayesian joint modelling framework that links individual-level antibody kinetics to test-negative vaccine effectiveness estimates across successive epidemic waves. Hybrid immunity is represented as the combined effect of vaccination and infection history, with latent antibody titres following a boost-and-decay process after each immunising event. A variant-specific titre-protection curve maps latent antibody levels to the risk of symptomatic infection. The framework is intended to illustrate how apparent changes in vaccine effectiveness may be decomposed into components related to waning, immune escape, and exposure heterogeneity. Using fully synthetic data calibrated to plausible vaccination schedules, infection histories, assay variability, and epidemic-wave structures, the model is evaluated in three simulation studies. The simulations illustrate that joint modelling can recover broad features of the assumed titre-protection relationship under idealised conditions and can separate waning from variant-specific shifts when the data-generating process is correctly specified. The results are not presented as validation on real-world surveillance data. Instead, they provide a transparent methodological proof of concept and identify assumptions that would need to be assessed before applying the framework to linked serological and test-negative datasets. Author declarationsThis manuscript reports a methodological simulation study. All individual-level data used in the manuscript are synthetic. No human participants, patient records, biological samples, or identifiable data were used. No ethics approval was required for the analyses presented here. The author declares no competing interests. This study did not receive external funding.

3

Capturing infant and child growth dynamics with P-splines mixed effects models

Hernandez, M. A.; Li, Z.; Cole, T.; Ong, Y. Y.; Tilling, K.; Elhakeem, A.

2025-10-24 epidemiology 10.1101/2025.10.22.25338570 medRxiv

Top 0.1%

51.2%

Show abstract

Investigating early life growth dynamics is crucial for a more comprehensive understanding of the developmental origins of obesity. Spline methods based on basis splines (B-splines) provide excellent flexibility for modelling complex nonlinear growth patterns, but they are prone to overfitting. To ensure good fit and avoid overfitting, B-splines can be extended by adding a penalty term to control their flexibility, resulting in what are commonly known as penalized B-splines (P-splines). Despite their strengths, P-splines are not yet widely used in epidemiology, partly due to a lack of practical guidance. This paper provides an illustrative guide to using P-spline linear mixed effects models to examine early life growth trajectories and estimate key growth features in longitudinal studies. After detailing P-spline theory and model fitting, we apply the method to repeated measurements of height, weight, and body mass index (BMI) up to age 10 years in a Southeast Asian birth cohort. We estimated infant growth velocity, and magnitude and timing of infant peak BMI and childhood rebound BMI, and explored sex differences, intercorrelations, and associations with prenatal factors. In our cohort, infant peak growth velocity was higher in boys than girls, ages of peak and rebound BMI had a negligible correlation, and greater birth length was associated with lower infant height velocity and higher weight velocity. We discuss practical considerations, alternative modelling approaches and provide recommendations for research. P-splines simplify the knot selection process, making them a valuable approach for growth modelling. R library, code and datasets are provided to accelerate uptake.

4

Interpreting Health Differences between Self-reported Black and White Children in U.S.: Insights from a Methodological Perspective

Yang, F. N.; Duyn, J. H.; Xie, W.

2024-10-02 public and global health 10.1101/2024.10.01.24314712 medRxiv

Top 0.1%

50.8%

Show abstract

Understanding health differences among racial groups in child development is crucial for addressing inequalities that may affect various aspects of a childs life. However, factors such as household and neighborhood socioeconomic status (SES) often covary with health differences between races, making it challenging to accurately reveal these differences using conventional covariate-control methods such as multiple regression. Alternative methods, such as Propensity Score Matching (PSM), may provide better covariate control. Supporting this notion, we found that PSM is more sensitive than regression-based methods in detecting health differences between self-reported Black and White children across a wide range of behavioral and neural measurements in the ABCD (5636 White, 1350 Black). Puberty status, an index of physical maturation, emerged as the largest difference between races and mediated the health differences between races on the majority of behavioral and neural variables. These findings highlight the importance of controlling for pubertal status and using more effective covariate-control methods to accurately represent health differences between Black and White children.

5

Cluster-weighted modified Poisson regression for estimating risk ratios in longitudinal data with informative cluster sizes

Bather, J. R.; Anyaso-Samuel, S.; Chen, Y.; Elliott, L.; Bennett, A. S.; Goodman, M. S.

2025-05-25 epidemiology 10.1101/2025.05.23.25328253 medRxiv

Top 0.1%

50.6%

Show abstract

Variation in binary outcomes over time by cluster size arises across various biomedical disciplines, including reproductive health, dental medicine, and psychiatric epidemiology. This study formally integrates modified Poisson regression with cluster-weighted generalized estimating equations (MP-CWGEE) for computing risk ratios in longitudinal studies with informative cluster sizes. Using a comprehensive Monte-Carlo simulation study, we empirically evaluated MP-CWGEEs statistical properties against alternative modeling approaches: MP-GEE, log-binomial CWGEE (LB-CWGEE), and log-binomial GEE (LB-GEE). We conducted 1,000 simulations across varying sample sizes, risk ratios, and informativeness degrees. MP-CWGEE demonstrated superior performance in model convergence, empirical bias, average estimated standard error, coverage, and Type 1 error control. While LB-CWGEE showed comparable results, its convergence rates were slightly inferior. The benefits of cluster-weighted models (MP-CWGEE and LB-CWGEE) over unweighted models (MP-GEE and LB-GEE) were pronounced in scenarios with informative cluster sizes. We demonstrated MP-CWGEEs practical application to a cohort study of people who used illicit opioids in New York City. We also provided implementation code for R, Stata, and SAS to facilitate wider adoption of the MP-CWGEE approach.

6

A two-step penalization and shrinkage approach for binary response data that is jointly separated and correlated: The effects of social networks on diarrheal disease

Hegde, S.; Eisenberg, J. N.; Beesley, L. J.; Mukherjee, B.

2024-03-18 epidemiology 10.1101/2024.03.13.24304191 medRxiv

Top 0.1%

40.8%

Show abstract

Epidemiologic data often violate common modeling assumptions of independence between subjects due to study design. Statistical separation is also common, particularly in the study of rare binary outcomes. Statistical separation for binary outcomes occurs when regions of the covariate space have no variation in the outcome, and separation can negatively impact the validity of logistic regression model parameters. When data are correlated, we generally use multi-level modeling for parameter estimation, and statistical approached have also been developed for handling statistical separation. Approaches for analyzing data with both separation and complex correlation, however, are not well-known. Extending prior work, we demonstrate a two-stage Bayesian modeling approach to account for both separated and highly correlated data through a motivating example examining the effect of social ties on Acute Gastrointestinal Illness (AGI) in rural Ecuador. The two-stage approach involves fitting a Bayesian hierarchical model to account for correlation using priors derived from parameter estimates from a Firth-corrected logistic regression model to account for separation. We compare estimates from the two-stage approach to standard regression methods that only account for either separation or correlation. Our results demonstrate that correctly accounting for separation and correlation when both are present can potentially provide better inference.

7

Application of the Adaptive Validation Design to estimate the association between transmasculine/transfeminine status and self-inflicted injury among transgender and gender-nonconforming children and adolescents.

Collin, L. J.; MacLehose, R. F.; Ahern, T. P.; Goodman, M.; Lash, T. L.

2020-02-23 epidemiology 10.1101/2020.02.20.20024182 medRxiv

Top 0.1%

40.5%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSBackgroundC_ST_ABSAn internal validation substudy compares an imperfect measurement of a variable with a gold standard measurement in a subset of the study population. Validation data permit calculation of a bias-adjusted estimate, expected to equal the association that would have been observed had the gold standard measurement been available for the entire study population. Guidance on optimal sampling of participants to include in validation substudies has not considered monitoring validation data as they accrue. In this paper, we develop and apply the framework of Bayesian monitoring to determine when sufficient validation data have been collected to yield a bias-adjusted estimate of association with a prespecified level of precision. MethodsWe demonstrate the utility of this method using the Study of Transition, Outcomes and Gender--a cohort study of transgender and gender non-conforming children and adolescents. Transmasculine and transfeminine status were determined from the gender code in the electronic medical record at cohort enrollment. This status is known to be misclassified because it can indicate either gender identity or sex recorded at birth. Our interest is in the association between transmasculine and transfeminine status and self-inflicted injury. To address possible exposure misclassification, we demonstrate the methods ability to determine when sufficient validation data have been collected to calculate a bias-adjusted estimate of association that is less than 80% greater than the precision of the conventional estimate. ResultsIn the conventional age-adjusted analysis, we observed that transmasculine children and adolescents were 1.80-fold more likely to inflict self-harm than transfeminine youths (95%CI 1.27, 2.55). Using the adaptive validation approach, 200 cohort members were required for validation to yield a bias-adjusted estimate of OR=3.03 (95%CI 1.76, 5.56), which was similar to the bias-adjusted estimate using complete validation data (OR=2.63, 95%CI 1.67, 4.23). ConclusionsOur method provides a novel approach to effective and efficient estimation of classification parameters as validation data accrue. This method can be applied within the context of any parent epidemiologic study design, and modified to meet alternative criteria given specific study or validation study objectives.

8

Selection bias is unlikely to fully explain the protective effect of childhood adiposity on breast cancer risk.

Power, G. M.; Sanderson, E.; Davey Smith, G.; Hemani, G.

2025-09-12 epidemiology 10.1101/2025.09.10.25335479 medRxiv

Top 0.1%

39.9%

Show abstract

BackgroundHigher adiposity in early life has consistently been associated with a reduced risk of breast cancer in later life, with Mendelian randomization (MR) studies supporting a potential causal effect. However, concerns have been raised that selection bias, particularly collider stratification due to selective participation or survival, may induce spurious protective effects. MethodsWe used a triangulation framework combining empirical analyses and simulations to evaluate whether selection-induced bias could plausibly explain the inverse effect of early life body size on breast cancer risk. First, we re-examined proxy-genotype MR analyses and conducted family-based simulations to assess whether attenuation in relative-based estimates could arise without selection bias. Second, we performed multivariable MR analyses of parental survival to evaluate survival-related selection mechanisms. Third, we conducted extensive simulations under a null causal model to quantify the magnitude of bias introduced by selection under a range of plausible and extreme scenarios, including interaction-driven selection. ResultsAttenuation in proxy-genotype MR estimates was reproduced in simulations without selection bias, indicating that this pattern does not provide evidence for selection bias. Multivariable MR analyses of parental survival indicated that survival differences are primarily driven by adulthood, not childhood, adiposity, providing little support for survival-related selection acting through childhood body size. In simulation analyses, additive selection produced minimal bias, while interaction-driven selection generated increasing distortion; however, even under extreme scenarios, the magnitude of bias was insufficient to replicate the observed protective effect. When selection operated through adulthood body size, bias was confined largely to adulthood estimates. Across all scenarios, the joint pattern of univariable and multivariable MR findings was not reproduced under selection alone. ConclusionsAlthough selection bias can influence MR estimates, our findings suggest that plausible selection mechanisms are unlikely to fully explain the observed inverse effect of early life adiposity on breast cancer risk. These results support a causal interpretation of the protective effect and highlight the value of triangulating evidence across complementary approaches when evaluating bias in lifecourse MR.

9

Biases in Attribution Methods for Norovirus and Rotavirus Diarrhea

Chen, D.; Shioda, K.; Brouwer, A.; Kraay, A.; Handel, A.; Lopman, B.; McQuade, E. R.; Nelson, K.

2025-10-27 epidemiology 10.1101/2025.10.24.25338730 medRxiv

Top 0.1%

39.7%

Show abstract

BackgroundThe estimate of diarrhea burden attributed to a specific enteric pathogen--the population attributable fraction (PAF)--depends on the specific calculation method. Two conventional methods are commonly used to estimate the PAF for enteric infections: the "detection-as-etiology" (DE) method, which defines the PAF as the pathogen prevalence in diarrheal cases; and the "odds-ratio" (OR) method, which expresses the PAF as a function of the OR between pathogen detection and diarrhea. A third, less frequently used method uses the risk ratio (RR) to quantify the strength of infection. MethodsWe compared each conventional PAF (DE, OR, or RR PAF) to a model-based (MB) PAF, derived from a transmission model of enteric infection, and defined bias as the crude difference from this "true" MB PAF. We fitted the transmission model to site-specific qPCR data for norovirus and rotavirus detection from MAL-ED (an eight-country birth cohort studying enteric infections) and used the equilibrium states to calculate the MB PAF. ResultsFor both pathogens, the OR and RR biases were small at all sites (ranging from -5% to +3%), whereas the DE method consistently overestimated the PAF and its bias was the largest of the conventional methods. ConclusionsOur mechanistic model provides an independent alternative to conventional methods, quantifying pathogens-specific enteric burden and the biases in those methods. Our model suggests the DE PAF estimations are consistently biased, and validates the OR and RR methods as feasible, low-bias measures for quantifying enteric burden.

10

Bias from small-count suppression in county-level cancer disparity estimates: a calibrated simulation study

gahan, k.

2026-06-08 epidemiology 10.64898/2026.06.05.26355021 medRxiv

Top 0.1%

38.0%

Show abstract

Abstract Background. Area-level cancer disparities are routinely estimated from public county data in which rates based on small counts (fewer than 16 cases or deaths) are suppressed. Analysts typically drop suppressed counties (complete-case analysis). Because suppression depends on case counts tied to population size and demographic composition, this missingness may be informative, but its effect on the disparity estimate has not, to our knowledge, been quantified. Methods. In a cross-sectional ecological study of 3,143 U.S. counties (analytic sample 3,018 with computable exposure) using one frozen public release of NCI State Cancer Profiles incidence and mortality data and ACS 2018-2022 5-year data, we estimated the most- versus least-deprived ICE(race+income) quintile rate ratio (RR) and rate difference for female breast, stomach, and cervix cancers under four suppression-handling methods: complete-case, available-case, bounding, and model-based small-area estimation. We characterized which counties were erased, and, following the ADEMP framework, ran a Monte Carlo simulation (1,000 replicates per cell; Monte Carlo standard error of bias approximately 0.0025) calibrated to the release to measure bias against a known truth. Analyses were pre-registered. Results. The suppressed fraction rose with rarity: 7.4% of counties for breast, 61.3% for stomach, and 75.7% for cervix incidence. Suppression was concentrated in the most-deprived quintile (cervix, 81.8% suppressed vs 63.8% least-deprived) and overwhelmingly removed rural rather than minority residents (cervix: 81% of the rural but 9% of the minority population erased). For breast (little suppression) the RR was 0.87 (95% CI 0.85-0.89) and identical across methods; for cervix incidence the complete-case RR (1.56) exceeded the model-based estimate (1.50), and for cervix mortality (91% suppressed) complete-case (1.86) exceeded model-based (1.56) by 16% with a wide bounding interval (1.88-2.62). In calibrated simulation, population-weighted complete-case bias was small (less than 2%) at the observed deprivation-county-size correlation and grew with rarity, threshold, and unweighted aggregation; its direction was conditional, becoming positive (over-estimation) as deprived counties became smaller. Conclusions. Complete-case handling of suppressed counties over-estimates rare-cancer area disparities relative to methods that retain them, while silently erasing most of the rural and most-deprived communities the estimate is meant to represent. The effect is negligible for common cancers and grows with rarity. Public-data disparity analyses should report the suppressed fraction and use bounded or model-based estimates by default. Keywords: cancer disparities; small-count suppression; Index of Concentration at the Extremes; informative missingness; small-area estimation; rural health.

11

Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

Salvatore, M.; Kundu, R.; Du, J.; Friese, C. R.; Mondul, A. M.; Hanauer, D. A.; Lu, H.; Pearce, C. L.; Mukherjee, B.

2024-10-29 epidemiology 10.1101/2024.10.28.24316286 medRxiv

Top 0.1%

37.9%

Show abstract

Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

12

Design and Estimation for the Population Prevalence of Infectious Diseases

Oh, E. J.; Mikytuck, A.; Lancaster, V.; Goldstein, J.; Keller, S.

2021-02-08 epidemiology 10.1101/2021.02.05.21251231 medRxiv

Top 0.1%

33.6%

Show abstract

Understanding the prevalence of infections in the population of interest is critical for making data-driven public health responses to infectious disease outbreaks. Accurate prevalence estimates, however, can be difficult to calculate due to a combination of low population prevalence, imperfect diagnostic tests, and limited testing resources. In addition, strategies based on convenience samples that target only symptomatic or high-risk individuals will yield biased estimates of the population prevalence. We present Bayesian multilevel regression and poststratification models that incorporate probability sampling designs, the sensitivity and specificity of a diagnostic test, and specimen pooling to obtain unbiased prevalence estimates. These models easily incorporate all available prior information and can yield reasonable inferences even with very low base rates and limited testing resources. We examine the performance of these models with an extensive numerical study that varies the sampling design, sample size, true prevalence, and pool size. We also demonstrate the relative robustness of the models to key prior distribution assumptions via sensitivity analyses.

13

Direct and mediated effects (DME) SLCMA: a novel method for life course modelling with time-varying covariates

Beer, S.; Simpkin, A. J.; Eldeeb, S. Y.; Zar, H. J.; Stein, D. J.; Dunn, E. C.; Smith, A. D. A. C.

2026-06-06 epidemiology 10.64898/2026.05.29.26354427 medRxiv

Top 0.1%

33.3%

Show abstract

Background: In prospective cohort studies, where an exposure is collected repeatedly, interest often lies in determining whether the timing of that exposure has a differential effect on a later outcome. The Structured Life Course Modeling Approach (SLCMA), where users select between temporal hypotheses of exposure specified a priori, provides one way to analyse such longitudinal data. However, few studies using SLCMA consider the effect of time-varying covariates (TVC) which may impact associations. Methods: We present a modified version of the SLCMA - called direct and mediated effects (DME)-SLCMA - which corrects for TVC. We first develop the DME-SLCMA method, test it through simulation, and apply it to psychosocial data from the Drakenstein Child Health Study (DCHS, n=336) to investigate relationships between maternal psychopathology, TVC of socioeconomic status, and offspring depressive symptoms. Results: We found that, on average, offspring depressive symptoms score increased by 3.9% (95% CI: 1.0%-6.9%, p = 0.039) for each unit of maternal psychopathology (SRQ) at 48 months whilst adjusting for time-varying socioeconomic status (at 18, 30, 42 and 54 months). Our simulations identified several realistic scenarios where selections ignoring TVC - with TVC mediated exposure effects present - were prone to be incorrect, including our DCHS example. Conclusion: DME-SLCMA is a robust new approach for life course modelling in the presence of time-varying covariates. We recommend adjusting for TVC whenever possible, and, when not possible, our simulation study identified that scenarios where mediated effects are comparable, or greater, in magnitude to direct effects are most prone to confounding.

14

Variations in the results of nutritional epidemiology studies due to analytic flexibility: Application of specification curve analysis to red meat and all-cause mortality

Wang, Y.; Pitre, T.; Wallach, J. D.; de Souza, R. J.; Jassal, T.; Bier, D.; Patel, C. J.; Zeraatkar, D.

2023-12-21 epidemiology 10.1101/2023.12.19.23300248 medRxiv

Top 0.1%

32.9%

Show abstract

ObjectiveTo present an application of specification curve analysis--a novel analytic method that involves defining and implementing all plausible and valid analytic approaches for addressing a research question--to nutritional epidemiology. Data sourceNational Health and Nutrition Examination Survey (NHANES) 2007 to 2014 linked with National Death Index. MethodsWe reviewed all observational studies addressing the effect of red meat on all-cause mortality, sourced from a published systematic review, and documented variations in analytic methods (e.g., choice of model, covariates, etc.). We enumerated all defensible combinations of analytic choices to produce a comprehensive list of all the ways in which the data may reasonably be analyzed. We applied specification curve analysis to NHANES data to investigate the effect of unprocessed red meat on all-cause mortality, using all reasonable analytic specifications. ResultsAmong 15 publications reporting on 24 cohorts included in the systematic review on red meat and all-cause mortality, we identified 70 unique analytic methods, each including different analytic models, covariates, and operationalizations of red meat (e.g., continuous vs. quantiles). We applied specification curve analysis to NHANES, including 10,661 participants. Our specification curve analysis included 1,208 unique analytic specifications. Of 1,208 specifications, 435 (36.0%) yielded a hazard ratio equal to or above 1 for the effect of red meat on all-cause mortality and 773 (64.0%) below 1, with a median hazard ratio of 0.94 [IQR: 0.83 to 1.05]. Forty-eight specifications (3.97%) were statistically significant, 40 of which indicated unprocessed red meat to reduce all-cause mortality and 8 of which indicated red meat to increase mortality. ConclusionWe show that the application of specification curve analysis to nutritional epidemiology is feasible and presents an innovative solution to analytic flexibility. LimitationsAlternative analytic specifications may address slightly different questions and investigators may disagree about justifiable analytic approaches. Further, specification curve analysis is time and resource-intensive and may not always be feasible.

15

School-located influenza vaccination and community-wide indirect effects: reconciling mathematical models to epidemiologic models

Arinaminpathy, N.; Reed, C.; Biggerstaff, M.; Nguyen, A.; Athni, T. S.; Arnold, B. F.; Hubbard, A. E.; Colford, J. M.; Reingold, A.; BENJAMIN-CHUNG, J.

2022-10-13 infectious diseases 10.1101/2022.10.08.22280870 medRxiv

Top 0.1%

32.3%

Show abstract

BackgroundMathematical models and empirical epidemiologic studies (e.g., randomized and observational studies) are complementary tools but may produce conflicting results for a given research question. We used sensitivity analyses and bias analyses to explore such discrepancies in a study of the indirect effects of influenza vaccination. MethodsWe fit an age-structured, deterministic, compartmental model to estimate indirect effects of a school-based influenza vaccination program in California that was evaluated in a previous matched cohort study. To understand discrepancies in their results, we used 1) a model with constrained parameters such that projections matched the cohort study; and 2) probabilistic bias analyses to identify potential biases (e.g., outcome misclassification due to incomplete influenza testing) that, if corrected, would align the empirical results with the mathematical model. ResultsThe indirect effect estimate (% reduction in influenza hospitalization among older adults in intervention vs. control) was 22.3% (95% CI 7.6% - 37.1%) in the cohort study but only 1.6% (95% Bayesian credible intervals 0.4 - 4.4%) in the mathematical model. When constrained, mathematical models aligned with the cohort study when there was substantially lower pre-existing immunity among school-age children and older adults. Conversely, empirical estimates corrected for potential bias aligned with mathematical model estimates only if influenza testing rates were 15-23% lower in the intervention vs. comparison site. ConclusionsSensitivity and bias analysis can shed light on why results of mathematical models and empirical epidemiologic studies differ for the same research question, and in turn, can improve study and model design.

16

Estimating Excess Mortality Among People Living with HIV/AIDS During the COVID-19 Pandemic in the USA

Hall, L.; Chowell, G.

2025-06-10 hiv aids 10.1101/2025.06.09.25329225 medRxiv

Top 0.1%

32.1%

Show abstract

ObjectivesTo quantify the all-cause excess death rate of people living with HIV/AIDS (PWHA) during the multi-year 2020-2022 COVID-19 pandemic in the United States (U.S.), including stratifications by sex, age, race/ethnicity, and region. DesignUsing publicly available data from the CDC NCHHSTP AtlasPlus dashboard, we employed the ensemble n-subepidemic modeling framework (SubEpiPredict toolbox). This dynamic, uncertainty-aware approach was used to generate counterfactual forecasts of U.S. deaths among PWHA for 2020-2022. MethodsThe models were calibrated using 12 years of pre-pandemic mortality trends (2008-2019), with the median excess death rate calculated as the difference between forecasted and observed death rates. Results were stratified by age, sex, race/ethnicity, and U.S. region. ResultsOverall excess mortality among PWHA was estimated at 7,783 crude excess deaths (95% prediction interval [PI]: 5,098-10,525), corresponding to 2.77 excess deaths per 100,000 people (95% PI: 1.81-3.75), with the largest burden observed in 2021. Excess death rates were highest among males (3.39), individuals aged 55-64 years (4.94), multiracial populations (12.82), and residents of the Northeast U.S. (4.12). In contrast, the largest absolute number of excess deaths occurred among males (4,692), adults aged 65 years and older (2,560), Black/African American individuals (3,969), and residents of the Southern U.S. (4,025). ConclusionsThese systematic, model-based results reveal stark heterogeneities among PWHA by exposing recent mortality patterns that may not be captured by disease-specific mortality reporting alone. These heterogeneous findings can inform future public health programming and resource allocation and support tailored interventions for vulnerable populations.

17

HHBayes: A Flexible Bayesian Framework for Simulating and Analyzing Household Transmission Dynamics

Li, K.; Hou, Y.; Mukherjee, B.; Pitzer, V. E.; Weinberger, D. M.

2026-04-03 infectious diseases 10.64898/2026.04.01.26349903 medRxiv

Top 0.1%

28.9%

Show abstract

Household transmission studies are important for understanding infectious disease transmission and evaluating interventions; however, they are frequently constrained by methodological challenges, including in study design and sample size determination, and in estimating parameters of interest after collecting the data. Existing tools often lack flexibility in modeling age-specific susceptibility, infectivity patterns, and the impact of interventions such as vaccination or prophylaxis. Here, we develop HHBayes, an open-source R package that provides a unified framework for simulating and analyzing household transmission data using Bayesian methods. The package enables researchers to: (1) simulate realistic household transmission dynamics with highly customizable variables; (2) incorporate viral load data (measured in viral copies/mL or cycle threshold values) to model time-varying infectiousness; (3) estimate age-dependent susceptibility and infectivity parameters using Hamiltonian Monte Carlo methods implemented in Stan; and (4) evaluate intervention effects through user-defined covariates that modify susceptibility or infectivity. We demonstrate the capabilities of the package through simulation studies showing accurate parameter recovery and applications to seasonal respiratory virus transmission, including the impact of vaccination and antiviral prophylaxis on household attack rates. HHBayes addresses a critical gap in infectious disease epidemiology by providing researchers with accessible tools for both prospective study design and retrospective data analysis. The flexibility of the package in handling complex household structures, time-varying infectiousness, and intervention effects makes it valuable for studying diverse pathogens.

18

Correcting for effect modification in the doubly-ranked non-linear Mendelian randomization method

Zhou, A.; Tian, H.; Patel, A.; Mason, A.; Yang, G.; Hypponen, E.; Burgess, S.

2026-01-23 epidemiology 10.64898/2026.01.22.26344640 medRxiv

Top 0.1%

28.3%

Show abstract

The doubly-ranked non-linear Mendelian randomization method can yield biased estimates when instrument strength varies across individuals due to gene-environment (GxE) interactions. We propose a simple strategy to mitigate this bias by modelling GxE interactions and removing the fitted GxE component from the exposure before stratification by the doubly-ranked method. In simulations, the proposed GxE correction strategy eliminated GxE-induced bias with null, linear and non-linear exposure-outcome relationships, and it did not introduce bias even when the effect modifier of the IV-exposure association was a confounder or was correlated with a mediator or collider of the exposure-outcome association. In empirical analyses of serum 25(OH)D, BMI, and LDL-C, falsification tests showed bias in the uncorrected doubly-ranked method. Under the selected panel of effect modifiers, the extent of bias attenuation achieved by GxE correction varied by exposures. GxE correction was most effective for LDL-C, with further support from analyses using negative controls (age at recruitment and sex) and coronary artery disease as a positive control. These findings provide proof of principle evidence that our proposed GxE correction strategy can mitigate GxE-induced bias in practice. Where applicable, we recommend implementing this GxE correction strategy as a sensitivity analysis to assess the robustness of findings from the doubly-ranked method.

19

Addressing spatial misalignment in population health research: a case study of US congressional district political metrics and county health data

Nethery, R. C.; Testa, C.; Tabb, L. P.; Hanage, W. P.; Chen, J. T.; Krieger, N.

2023-01-11 epidemiology 10.1101/2023.01.10.23284410 medRxiv

Top 0.1%

28.1%

Show abstract

Areal spatial misalignment, which occurs when data on multiple variables are collected using mismatched boundary definitions, is a ubiquitous obstacle to data analysis in public health and social science research. As one example, the emerging sub-field studying the links between political context and health in the United States faces significant spatial misalignment-related challenges, as the congressional districts (CDs) over which political metrics are measured and administrative units, e.g., counties, for which health data are typically released, have a complex misalignment structure. Standard population-weighted data realignment procedures can induce measurement error and invalidate inference, which has prompted the development of fully model-based approaches for analyzing spatially misaligned data. One such approach, atom-based regression models (ABRM), holds particular promise but has scarcely been used in practice due to the lack of appropriate software or examples of implementation. ABRM use "atoms", the areas created by intersecting all sets of units on which variables of interest are measured, as the units of analysis and build models for the atom-level data, treating the atom-level variables (generally unmeasured) as latent variables. In this paper, we demonstrate the feasibility and strengths of the ABRM in a case study of the association between political representatives voting behavior (CD-level) and COVID-19 mortality rates (county-level) in a post-vaccine period. The adjusted ABRM results suggest that more conservative voting record is associated with an increase in COVID-19 mortality rates, with estimated associations smaller in magnitude but consistent in direction with those of standard realignment methods. The results also indicate that ABRM may enable more robust confounding adjustment and more realistic uncertainty estimates, properly representing the uncertainties arising from all analytic procedures. We also implement the ABRM in modern optimized Bayesian computing programs and make our code publicly available, which may enable these methods to be more widely adopted.

20

Estimating the generation time for SARS-CoV-2 transmission using household data in the United States, December 2021 - May 2023

Chan, L. Y. H.; Morris, S. E.; Stockwell, M. S.; Bowman, N. M.; Asturias, E.; Rao, S.; Lutrick, K.; Ellingson, K. D.; Nguyen, H. Q.; Maldonado, Y.; McLaren, S. H.; Sano, E.; Biddle, J. E.; Smith-Jeffcoat, S. E.; Biggerstaff, M.; Rolfes, M. A.; Talbot, H. K.; Grijalva, C. G.; Borchering, R. K.; Mellis, A. M.; RVTN-Sentinel Study Group,

2024-10-11 infectious diseases 10.1101/2024.10.10.24315246 medRxiv

Top 0.1%

27.3%

Show abstract

BackgroundGeneration time, representing the interval between infection events in primary and secondary cases, is important for understanding disease transmission dynamics including predicting the effective reproduction number (Rt), which informs public health decisions. While previous estimates of SARS-CoV-2 generation times have been reported for early Omicron variants, there is a lack of data for subsequent sub-variants, such as XBB. MethodsWe estimated SARS-CoV-2 generation times using data from the Respiratory Virus Transmission Network - Sentinel (RVTN-S) household transmission study conducted across seven U.S. sites from December 2021 to May 2023. The study spanned three Omicron sub-periods dominated by the sub-variants BA.1/2, BA.4/5, and XBB. We employed a Susceptible-Exposed-Infectious-Recovered (SEIR) model with a Bayesian data augmentation method that imputes unobserved infection times of cases to estimate the generation time. FindingsThe estimated mean generation time for the overall Omicron period was 3.5 days (95% credible interval, CrI: 3.3-3.7). During the sub-periods, the estimated mean generation times were 3.8 days (95% CrI: 3.4-4.2) for BA.1/2, 3.5 days (95% CrI: 3.3-3.8) for BA.4/5, and 3.5 days (95% CrI: 3.1-3.9) for XBB. InterpretationOur study provides estimates of generation times for the Omicron variant, including the sub-variants BA.1/2, BA.4/5, and XBB. These up-to-date estimates specifically address the gap in knowledge regarding these sub-variants and are consistent with earlier studies. They enhance our understanding of SARS-CoV-2 transmission dynamics by aiding in the prediction of Rt, offering insights for improving COVID-19 modeling and public health strategies. FundingCenters for Disease Control and Prevention, and National Center for Advancing Translational Sciences.